arXiv:1506.06924vl [cs.SE] 23 Jun 2015 


F. Schweitzer, V. Nanumyan, C.J. Tessone, X. Xia: 

How do OSS communities change in number and size? 
Published in Advances in Complex Systems, 17, 1550008 (2014) 


How do OSS projects change in number and size? 

A large-scale analysis to test a model of project growth 

Prank SchweitzerQ Vahan Nanumyan, Claudio J. Tessone , Xi Xia 

Chair of Systems Design, ETH Zurich, Weinbergstrasse 58, 8092 Zurich, Switzerland 

Abstract 

Established Open Source Software (OSS) projects can grow in size if new developers join, but also 
the number of OSS projects can grow if developers choose to found new projects. We discuss to what 
extent an established model for firm growth can be applied to the dynamics of OSS projects. Our 
analysis is based on a large-scale data set from SourceForge (SF) consisting of monthly data for 10 
years, for up to 360 000 OSS projects and up to 340 000 developers. Over this time period, we find 
an exponential growth both in the number of projects and developers, with a remarkable increase of 
single-developer projects after 2009. We analyze the monthly entry and exit rates for both projects 
and developers, the growth rate of established projects and the monthly project size distribution. 
To derive a prediction for the latter, we use modeling assumptions of how newly entering developers 
choose to either found a new project or to join existing ones. Our model applies only to collaborative 
projects that are deemed to grow in size by attracting new developers. We verify, by a thorough 
statistical analysis, that the Yule-Simon distribution is a valid candidate for the size distribution of 
collaborative projects except for certain time periods where the modeling assumptions no longer hold. 
We detect and empirically test the reason for this limitation, i.e., the fact that an increasing number 
of established developers found additional new projects after 2009. 


1 Introduction 

Open Source Software (OSS) communities share with other organizations, such as social online platforms 
m or research and development networks |18| the feature that they are inherently dynamic because of 
the continuous entry of new members (developers, users, firms) and exit of established members. While 
this entry and exit dynamics usually resembles small perturbations that do not challenge the existence 
of the organization, it can also lead to large cascades of members leaving [3], in particular if these 
depend on the contribution of those who left. Hence, these processes have the potential to destabilize an 
organization. On the other hand, the entry-exit dynamics plays an important role in knowledge exchange 
between organizations. New members can bring new knowledge, information, skills, or methods that help 
organizations to innovate. Members leaving, on the other hand, make space for newcomers and at the 
same time transfer knowledge they gained to other organizations. 

The economist Schumpeter saw the creative destruction process induced by newcomers as an important 
element to renew, and to develop, the economic system m- Consequently, economists have for a long 
time focused on the role of entry and exit of firms in industrial organization [9]. For example, they 
found positive correlations between the entry rate of firms and innovation rates [B]. A particular strand 
of research was devoted to the impact of newcomers on the size distribution of firms. This is a long¬ 
standing topic in industrial organization since Gibrat m introduced the law of proportionate growth, 
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i.e., X = P X, where x represents the firm size as measured by the number of employees, to explain the 
empirical size distribution of firms. His assumptions lead to a log-normal distribution which is valid only 
if the number of firms is kept constant. An important extension was made by Simon |12| who combined 
the model of proportionate growth with assumptions about firm’s entry. This yields another type of skew 
size distribution, which he called the Yule distribution HU, but is now commonly called the Yule-Simon 
distribution. It is characterized by a power-law tail, f{x) oc x~'^ , for large values of x. 

The debate about whether the firm size distribution is best described by a log-normal, Yule-Simon, or a 
power-law distribution is still ongoing and the answer largely depends on the dataset analyzed. Therefore, 
we focus more on the theoretical insights obtained from these investigations. In particular, we ask to what 
extent an economic model, i.e., the Simon model for the entry and subsequent growth of firms, can be 
utilized to describe the dynamics in other types of social organizations, for example OSS communities. 

It would indeed add to the importance of the Simon model if we find that it also describes the empirical 
findings in the dynamics of OSS communities. On the other hand, a formal model of the entry dynamics 
and growth of OSS communities which focuses on the choice of developers is a rather new and important 
contribution to better understand the complex processes in socio-technical systems Eiini. Precisely, the 
novel contribution of our paper is not in the development of the model, but in the discussion to what 
extent an existing economic model describes the dynamics in OSS communities and how it could be 
extended for this purpose. 

The methodological approach to test a model of firm growth for OSS communities is based on some 
implicit analogies. With OSS community, we refer to a specific platform that hosts possibly hundreds of 
thousands of OSS projects, such as Sourceforge.net or Github.com. I.e., the community is comprised 
of projects each of which has a number of developers contributing. We note that such as system is best 
described as a bipartite network as discussed in Sect. |2.4[ Continuing with the analogy to industrial 
organization, this system is equivalent to a particular industrial sector (also called market). This market 
is comprised of thousands of firms each of which has a number of employees. The size of the firm is given 
by the number of employees, as the size of the project is given by the number of developers. 

With respect to the dynamics, we observe a continuous entry of new firms/projects that into the mar¬ 
ket/platform, but likewise also a continuous exit, e.g., if firms go bankrupt or projects collapse. But it 
is not the firms/projects drive the dynamics. The real drivers are the underlying constituting elements, 
i.e., the employees/developers that create new firms/projects or join existing ones, or decide to quit. 
This leaves a considerable degree of freedom. Employees/developers usually decide individually which 
firm/project to join or whether to establish a new firm/project. Only the latter choice leads to an in¬ 
crease in the total number of firms/projects, while the former still results in an increase in the size of a 
given firm/project. Interestingly, on the system’s level this individual choice can be described by a certain 
probability to found a new firm/project, which is constant and the same across the population. While 
this does not reflect the individual motivation, it is sufficient to describe the dynamics on the system’s 
level. 

Hence, with a focus on the OSS community we are not so much interested in the individual dynamics 
of specific projects which would be better captured in case studies [3D]. Instead, we want to investigate 
systemic properties that result from a large number of projects. Such an approach does not necessarily 
address a number of issues that may be also of interest in the study of OSS communities, such as the 
motivation of developers Bl, their individual activity uni or their specific role/skills in the project. 
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Our paper is organized as follows: Before we propose in, Sect. a model to capture the dynamics of 
projects (and indirectly also of developers), in Sect. we look at the macroscopic properties of the 
community, obtained by aggregating over all projects. In Sect. we also analyze in detail the entry and 
exit rates of projects and developers and particularly focus on the size dependence of growth rates. The 
model we develop leads to a prediction for the size distribution of projects, which is compared with 
empirical data in Sect. There, we also discuss reasons for deviations from the prediction and possible 
extensions of our work. 


2 Aggregated data analysis 

2.1 Dataset description 

The dataset used in our study was acquired from Sourceforge.net (SF), which was one of the world’s 
largest Open Source software development website until Github.com became predominant. Our analyzed 
dataset contains 89 monthly snapshots from January 2003 to June 2012, in which information about 
all the developers and projects hosted on SF is recorded. In their early years, SF frequently changed 
the format in which information about the project was stored. This leads to disruptions in our dataset 
because of corrupt data, in particular snapshots between February 2003 and October 2004 and for January 
2005 are missing. Also, snapshots for July, August and September 2007 were removed by the SF archive 
provider because of data corruption [S]. Eventually, in February 2006, SF launched an Autopurge Service 
to remove inactive projects, which resulted in abnormal dropdowns in the number of projects. Starting 
from June 2010, SF automatically created a project for each developer. These “projects” do not represent 
real activities of the developers, so we removed them from the analysis. Also, in total 3 developers/projects 
were manually removed from the dataset. These three points had an extremely high number of links, were 
created by machines, and were used for the purpose of advertising or testing. 

Nevertheless, the remaining dataset is large and reliable enough for our analysis. Table[2gives an overview 
of the total number of projects, Np{t), and developers, Nd{t), in the first and the last month recorded 
in our dataset. We also have information about the relationship between projects and developers, in 
particular about the entry date (month) in which a developer joined a project. We then assume that a 
link between the developer and the project was created. The total number of links between developers 
and projects, K{t), is also reported in Table We note that, due to the lack of data, the programming 
language used is available only for about 40% of all projects. 

Table 1: Summary of available data for the first and the last monthly snapshot of the dataset. Nd{t): 
total number of developers at time t, Np{t); total number of projeets at time t, K{t): total number of 
links between developers and projects at time t 


Time 

01/2003 

07/2012 

Nd{t) 

77.050 

339.140 

Npit) 

54.234 

357.555 

K{t) 

106.840 

576.238 
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2.2 Aggregated growth rates 

The most simple aggregated statistics is given by the total number of projects, Np{t), developers, Nd{t) 
and links, K{t), and how these numbers change over time measured in months. Figure [^left) shows their 
evolution. As we clearly observe in the log-linear plot, all of these quantities follow an exponential growth 
dynamics: 

= ujX ; X{t) oc exp{a;t} (1) 

This is also known as the law of proportionate growth and indicates that the SF community became more 
attractive the larger it was, which reamplified the growth for many years. The respective growth rates ui 
with Af=l month are given in Table 




Figure 1: (left) Total number of projects, Np{t) (yellow), total number of developers, Nd{t) (blue), 
and total number of links, K(f) (green), over time measured in months. The solid lines indicate fits 
of the growth rates given in Table [1 (right) Number of new single-developer projects (yellow) and 
multi-developer projects (blue) per month, over time. Solid lines represent the median obtained over 
a rolling window of one year. The EM estimate (green) is discussed in Sect. |5.2| The missing points in 


2006 correspond to the time periods when Autopurge was used excessively (see Sect. 2.11. 


Table 2: 


Regression results for Fig. gieft) 


Variable 

Growth Rate w 

i?2 

p-value 

K(t) 

1.30% 

> 0.99 

2.80e-99 

Nd{t) 

1.27% 

> 0.99 

8.18e-99 

Npit) 

1.54% 

> 0.99 

8.06e-82 

Np{t < 2010) 

1.33% 

> 0.99 

6.19e-55 

Np{t > 2010) 

1.81% 

> 0.99 

2.30e-33 


We note that 
A closer look 


the exponential growth remains despite of the data disruptions explained in Section 2.1 


at Fig. and Table reveals that, in the log-linear plot, both Nd{t) and K{t) grow at 
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about the same growth rate, constant over time. For Np{t), however, we observe a significant change in 
the growth rate at about 2010. Before 2010, Np{t) grew at a rate comparable to the other aggregated 
quantities, but the growth significantly increased afterwards. Remarkably, this increase does not become 
visible in the growth of the total nnmber of links. Hence, the network between developers and projects 
(to which the links refer) becomes sparser after 2010. 

We argne that the increasing growth rate for projects results from the fact that developers started 
their own single-developer projects. These conld be either new developers entering SF or developers who 
already registered at SF for another project and now create another one. This conjecture is explored in 
Fig. [fright) which plots the nnmber of new projects per month that have only one developer together 
with the respective nnmber of new projects per month that have more than one developer. We observe a 
significant increase of single-developer projects at about 2010, while the number of new multi-developer 
projects per month remain about the same over time. 


2.3 Change in programming languages 

One of the reasons for the observed change towards more single-developer projects could be in the 
rise of scripting languages for programming, snch as PHP and, more recently. Python. Such programming 
langnages have been widely adopted in particnlar for single developer projects, as we verify in our dataset. 
We already mentioned that only abont 40% of all projects list their programming langnage and some 
projects, especially large ones, also nse more than one programming langnage. Precisely, in 01/2003 
information abont the programming language was available for 35.089 projects, which increased to 187.168 
projects in 07/2012. There are in total 106 programming language listed in the dataset, but more than 
80% of all projects nse one of the major 7 langnages C, C#, C++, Python, PHP, Java and JavaScript. 
Each of the remaining 99 langnages has a share of less than 1 percent and is ignored in the following. 

The importance of the major 7 langnages changed considerably over time, as Fig. (left) shows. Despite 
the fact that this refers only to a snbset of projects, we can observe that C lost nearly 10% market share 
in 7 years (from 25% down to 15%), which is a loss of 40% of its original total market share against the 
other 6 langnages even if the absolute number of projects using C has increased. Java, on the other hand, 
increased its market share by abont 10%. Bnt the largest shares are taken by JavaScript and C#. 

Figure [fright) plots, for each of the 7 main programming languages, how the share of single-developer 
projects changes over time. We first note the trend towards more single-developer projects for all of 
these 7 langnages, but with noticeable langnage preferences. In Jnly 2012, 76% of all projects nsing C# 
are single-nser projects, followed by PHP with a share of 74% single-developer projects and Python and 
JavaScript with 72%. I.e., developers who prefer to work on their own, have a clear preference for these 
langnages. 

2.4 The bipartite network of developers and projects 

We now take a closer look at the developers and their projects. Both form a bipartite network, i.e., a 
network where links exist between different types of nodes. As explained above, we consider a link between 
a developer and a project if this developer has registered for the project regardless of her snbseqnent 
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c# 

PHP 

Python 

JavaScript 

C++ 

c 

Java 



Time 


Figure 2: (left) Share of the 7 most popular programming languages (normalized to 100%) over time 
measured in months, for all projects with available information about programming languages, (right) 
Share of single-developer projects (normalized for each of the 7 most popular programming languages 
separately) over time measured in months. 


activity. Thus, instead of a weighted network where the weight of the links reflects the contribution, 
in this paper we only consider an unweighted network. A sketch of this bipartite network is shown in 
Fig.d where 10 developers contribute to eight different projects. I.e., links between developers only exist 
through projects, and links between projects only through developers. 



Figure 3: (right) Example of a bipartite network where 10 developers labeled HI, • • • , HIO con¬ 
tribute to eight projects labeled PI, • • • PS. (middle) Projection of the network of developers (linked 
by common projects), (right) projection of the network of projects (linked by common developers). 


Nevertheless, we can draw two projections of this bipartite network also shown in Fig. one with respect 
to the developers and one with respect to the projeets. In these projection, a link between developers 
appears if both of them contribute to the same project, and a link between projects appears if both of 
them have the same developer contributing. We emphasize that the bipartite network and its projections 
are aggregated over a given time interval, i.e., a link essentially reflects that two developers contributed 
to the same project in the same time interval (and not necessarily at the same time). 
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Based on the aggregated description, we can define the degree ki of a developer i as the nnmber of links 
she has, i.e., the total number of projects she was involved over that time period. Likewise, we can also 
define the degree Xr of a project r as the total number of developers that contributed to this project 
over that time period, x is also called the size of the project, as measured by the number of developers. 
The network of developers can then be described by a degree distribution f{k) which gives the fraction of 
developers with degree k in the population of all developers, during the observation period. Likewise the 
degree distribution, later also called size distribution, f{x) gives the fraction of projects with x developers, 
during the observation period. Both distributions are plotted in Fig. I^for the snapshot of 06/2012, which 
is the last snapshot of our dataset. 



Size X Degree k 


Figure 4: (left) Degree (size) distribution of projects (i.e., number of developers per project), f{x), 
(right) Degree distribution of developers (i.e., number of projects a developer contributes to), /(fc), for 
the monthly snapshot of June 2012. 


We observe that both are very skew distributions, reflecting the fact that there is a considerable prob¬ 
ability to find projects of large sizes, or developers involved in very many projects. The distributions 
resemble known seale-free distributions (such as power-law distributions), which indicates that there is 
no characteristic scale (size, number of projects) for projects or developers. In fact, these are not pure 
power-law distributions (note the bend in the shape and a rather limited range), but the specific type 
will be discussed in Sect. [S] 

3 The growth of OSS projects: A microscopic perspective 

3.1 Entry and exit dynamics 

In this section, we analyze the dynamics of projects and of the developer community in more detail, 
by looking at the available data about birth and death of projects and entry and exit of developers, 
instead of the aggregated growth. Figure [^left) shows the number of new projects per months, as well 


22 








F. Schweitzer, V. Nanumyan, C.J. Tessone, X. Xia: 

How do OSS communities change in number and size? 
Published in Advances in Complex Systems, 17, 1550008 (2014) 


as the number of removed projects per month, while Fig. [fright) shows the corresponding numbers for 
developers. We call the underlying processes “entry’ 



and “exit” of projects or developers, respectively. 



Figure 5: (left) Number of new projects (yellow) and removed projects (grey) per month, (right) 
Number of new developers (blue) and removed developers (grey) per month. 


We can immediately observe that the number of entry events largely exceeds the number of exit events, 
for each month, both for projects and developers, with the exception of the large exit spikes observed in 
2006-2007 because of the project clean-up initiated by SF (see Sect. 2.1). The reason for the dominance 
of the entry dynamics in normal time periods is that most projects or developers are not really removed 
if they become inactive. In fact, it is not trivial to determine whether a project or a developer is really 
inactive. Often, the activity just decreases considerably, but does not cease to exist. Also, the fact that 
there is no activity in a given time period does not imply that there will be also no activity in the future. 
We have discussed this question in detail in m- In this paper, we do not speculate about inactivity and 
just take the computed exit rates as a matter of fact. For the modeling in Section]^ we take advantage 
of their very low numbers and will simply neglect the exit dynamics. 

Eventually, we also note the occasional large fluctuations in the exit rates, both for projects and develop¬ 
ers. These are the results of extraordinary efforts by SF to clean up the project and developer base, e.g., 
by testing and implementating the new Autopurge System after turning off the old one (see Section 2.1). 
During and shortly after this switch, either extremely higher or lower numbers of projects and developers 
were detected as inactive and removed. 


The second important observation is the growth of the monthly entry rates over time, both for projects and 
developers (indeed, for projects, we could also note an increase of the exit rates over time). High occasional 
fluctuations might result from seasonal factors (holidays) or high media attention. This growth on average 
is in line with the exponential growth observed both for projects and developers on the aggregated level as 
discussed in Section [2.2[ The law of proportionate growth tells us that SF, for the observed time interval, 
became more attractive the bigger it was. Hence, the monthly entry rates shall depend on the current 
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numbers of projects or developers, respectively. Figure plots these relative monthly entry rates 


9p{t) = 


Npit) - Np{t - At) 
Np{t) 


9d{t) = 


Ndit) - Nd{t - At) 

Ndit) 


( 2 ) 


both for projects and developers. We note that, despite some considerable fluctuations, they tend to 
vary around long-term stationary values gp, gd in a first order approximation (i.e., we do not discuss a 
non-linear dependency on N). 


So 



2005 2006 2007 2008 2009 2010 2011 2012 


2005 2006 2007 2008 2009 2010 2011 2012 


Time 


Time 


Figure 6: Relative monthly entry rate of projects, gp{t) (left), and of developers, gd(t) (right). The 
horizontal lines represent the median of all the values in the time series (solid line) and the dashed 
lines, 10% and 90% quantiles. The values read for gp-. 10%: 0.0131 , median: 0.0196, 90%: 0.0284, and 
for gd-. 10%: 0.0118, median: 0.0164, 90%: 0.0232 


3.2 Size dependent growth rates of projects 

So far, we have discussed the law of proportionate growth on the aggregated level, both with respect to 
the absolute numbers Nd, Np, and K, Eq. Q, and the relative growth rates, gd, gp, Eq. ([^. But we can 
also refer to the individual project level, to verify this dynamics. 

Recall that the size Xr of a project r is defined by the number of developers contributing to it (xr was 
also called the degree of the project because, in the bipartite network, links exist between developers and 
the project). Then, the growth dynamics on the individual project level reads as: 

+ A^) - Mt) „ ,3) 

If 7 = 1, we have a growth strictly proportional to size, which is also known as preferential attachment 
in network theory, i.e., nodes (projects) receive new links (developers) proportional to the number of 
existing links. 7 > 1 would indicate a super-linear growth. 

As we have seen on the aggregated level, growth rates heavily fluctuate for each month. Therefore, for 
the individual project level, we choose the time window Af=12 months, i.e., large enough to cancel out 
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2005 2006 2007 2008 2009 2010 2011 

Time 


Figure 7: (left) Averaged annual growth rate g{x) over project size x (measured by the number 
of developers) for each year separately, (right) Exponent 7 , Eq. ^ for each year, where error bars 
indicate the standard error. 


these fluctuations. We then compute the average growth rate g{x) of projects with similar size x for each 
year, which is shown in Fig.(left). We verify that the annual growth rate indeed increases with the size 
of the project, as described in Eq. (§, and we barely notice differences in this dependence for different 
years. 

For a closer inspection of the law of proportionate growth, we estimate the exponent 7 , Eq. from 
the data separately for each year. We find that 7 varies indeed between 1.23 and 1.35 during the seven 
years period, hence the growth is slightly super-linear for all times. However, because 7 is very close to 1, 
we can still argue that the law of proportionate growth approximately holds, but there is a higher order 
dependence on the project size (~ which may enter the proportionality factor /3, i.e., x = f3{x) x. 


4 A dynamic model of project growth 

4.1 Dynamic assumptions 

In the following, we focus on the dynamics at the project level, only. The number of projects of a given 
size X (measured by the number of developers contributing to the project) at a given time t is given by 
n{x,t). Each project of size x at time t belongs to the same size class Yt = x. Time t is assumed to be 
discrete (measured in months). The total number of projects, Np{t), and the total number of developers, 
Nd{t), contributing to projects are defined as: 

2^max ^max 

^ n-{x,t) ] Nd{t) = X n{x,t) (4) 

x—1 

For our dynamic assumptions, we follow the model of Simon |12| for the entry of new firms and the growth 
of existing firms. That means in our model the entry of new developers is assumed to be the only source 
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for (i) establishing new projects and (ii) enlarging existing ones. Precisely, we neglect the possibility that 
also established developers already involved in other projects found a new project. This is supported by 
the empirical finding that most developers are only involved in one project, as the degree distribution of 
developers in Fig. [fright) shows. But we will come back on the validity of this assumption in Sect. 

We further neglect fragmentation, or fork processes, i.e., an existing larger project splits up into two (or 
more) smaller projects, which could be also seen as the foundation of a new project of size a; > 1. We 
argue that such events exist but are comparably rare so that we cannot sufhently calibrate our model 
against such data, and simply neglect these events. 

The empricial finding of Fig. [^tells us that the number of developers has increased exponentially. We can 
include this in three different ways: (a) choosing a linearly increasing number of new entrants per time 
interval, (b) rescaling the time interval linearly such that the number of new entrants per time interval is 
constant, (c) replacing time t simply by the total number of developers Nd. We have chosen the latter as 
the most elegant way. Hence from Nd{t) oc exp {uit), we find the transformation t -A hi Nd/uJ. From now 
on Nd = N measures time in discrete steps, N, N + 1.... 

For the change of n{x,t), we discuss the following processes: 

1. A new project is founded: Here the assumption is that the project starts in the smallest size class 
X = 1. There is a certain (conditional) probability 

P^+\Ym+i = 1|1W = 0) = po (a^ = 1) (5) 

that we find in the next time step iV + 1 a new project of size 1 (where its former size 0 indicates 
that the project did not exist yet at time N). This probability is denoted as po G (0,1) and assumed 
to be constant in time, except for the very first time step = 0 at which no projects exist yet. So 
one has to be founded with certainty: 

Pi, = (Fi = l|Fo = 0) = 1 (6) 

i.e., we start the dynamic process with one project that is of the smallest possible size 1. Because 
at each time step only one new developer enters, the largest possible size of any project cannot be 
larger than N, i.e., we set Xmax = N- 

2 . An established project grows: Here the assumption is that the project only grows by attracting one 
new developerai a time. This event is described by the probability 

Pl'MxiYN+i=x\YN = x-l)=K{N) (x-ir nix-1, N) (x = 2,...,fV) (7) 

Le., the (conditional) probability of a project in size class (x — 1) to grow at time N is proportional 
to the number of projects in that size class, n(x — l,iV). However, the new developer may have 
a preference for larger or smaller projects, i.e., the probability to choose from size class (x — 1) 
is also proportional to (x — 1)“. a = 0 would recover the case of no size preference, which was 
discussed, e.g., in US]. a = 1 would be a preference directly proportional to the existing size, which 
was discussed in to cope with Gibrat’s law of proportionate growth. In the following, because 
of analytical tractability we will only consider the case a = 1. Its empirical evidence is discussed in 
the next section, while in Sect, [^consequences for different values of a are discussed. 
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K{N) is a proportionality constant that has to satisfy the condition that all probabilities sum np 
to 1. Using a = 1 from now on, we find: 

AT 

'^K{N) X n{x,N) + po = 1 (8) 

X—1 


N 

Because of Eq. (4), xn{x,N) = N, and we get from Eq. (8l 

X — 1 


K(N) N+po = l 


K{N) = 


1 -Po 
N 


Note that this would not hold if a ^ 1. Equation allows us to rewrite Eq. ([^ as 


PA+ = x\Yn = x-l) = il-po) 


{x — 1) n{x — 1, N) 
N ’ 


= 2,...,7V 


(9) 

( 10 ) 


Our kinetic assumptions as seen from the perspective of the developer, are summarized as follows: at 
each time step + 1, one new developer arrives. This developer has two options: (i) with probability pg 
she chooses to found a new project, (ii) with probability (1 — po), she chooses to join one of the projects 
that exist at time N, i.e., Np{N). Without any preference for larger projects, she will choose a project 
from size class x with a probability (1 — po)n{x, N)/N. But with the assumed size preference, a = 1, the 
proportional weight x comes into account. 

We emphasize that some of the dynamic processes one could think of are deliberatively neglected, e.g., 
we neglect that several developers join one or different projects during the same time interval (which 
can be solved by changing the time resolution). More importantly, we also neglect that developers switch 
between projects, i.e., some projects lose and some projects gain in developers, but the total number of 
developers does not change. Such dynamics can be seen as reallocation processes among projects, and 
are neglected here. 

Further we do not consider that existing projects shrink, i.e., loose in size if developers leave. Since we 
have opted out reallocation processes, it would mean that developers become inactive. Again, there is 
empirical evidence for this process (see Fig.j^. But the number of developers exiting is (a) rather constant 
in time, and (b) much smaller than the number of new developers arriving. Therefore, we will consider 
this in our model as a rescaling of the arrival rate of new developers and will not explicitly model the 
shrinking process. 

Eventually, we also do not consider that an established project ceases to exist, either. Such processes can 
happen in two ways: (a) the project goes extinct, and (b) two existing projects merge into a new one, 
with a larger project size. Again, in our model we neglect both processes. Projects often become inactive, 
but are rarely deleted, and for mergers and acquisitions the same argument as for the project forks apply. 

With these considerations the total number of projects at time N is given by: 

N 

Y,n{x,N)^l+Po{N-l) = Npo ( 11 ) 

X — 1 

The first term, 1, results from the fact that there exists a new project in the first time step. Then, during 
every time step from = 2 up to a new project appears with probability pq. Hence, if pg <C 1 and N 
is large, this gives approximately 1 + po{N — 1), which is NpQ. 
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4.2 The rate equation 


We now start formalizing the above assumptions by developing a rate equation for the relevant quantity 
n{x,N). This can change by two processes: (a) gain: a project of size x — 1 is chosen by the developer 
and thus advances to the next size class, leading to an increase in n(x, N) (b) loss: a project of size x is 
chosen by the developer and thus advances to the next size class, leading to a decrease in n{x,N). 


n{x, N + 1) - n(x, TV) = (1 - po) 


(x — 1) n(x — 1,-/V) xn{x,N) 


N 


N 


(x = 2,...,iV + l) (12) 


The LHS of Eq. (121 describes the net inflow of projects into size class x. Similarly we get for the number 

n{l,N) 


of new projects at time + 1: 

n(l, iV + 1) - n(l, N) = Po - {1 - po) 


N 


(13) 


The gain comes from funding new projects with a probability pq, whereas the loss results from the fact 
that a project of size 1 grows into size 2. 

In the following we will only consider “steady-state” distributions, i.e., we assume that each size class 
grows proportionally with N: 


i{x,N + l) N+1 


V x,N (if X < N) 


n{x,N) N 

With An(x, N) := n(x, N + 1) — n(x, N), we rewrite Eqs. ([12]), ^ as: 

{x — l)n{x — l^N) xn{x,N) 


(14) 


An{x,N) = (l-po) 
An{l,N) =po- (1 -Po) 


N 

n{l,N) 

N 


N 


{x = 2,...,N+1) 


From Eq. (14) it follows that: 


n(x, N + 1) = {1 + -^)n{x, N) 


An{x, N) = 


i(x, N) 
~/V 


(15) 


(16) 


Plugging Eq. (16) in Eqs. (12), (13), we have: 


An(x, N) = = (1 _ pp) 

An(l, N) = = -po - (1 - Po) 


(x — 1) n(x — 1, A) xn{x,N) 


N 

n(l, A) 


A 


X = 2,..., A 


N A 

which simplifies to the following set of equations: 


(17) 


0=(1— po)(a:~l) ^{x — 1, A) — (1 — pq)x n{x, A) — n(x, A) 
0 = Apo - (1 - Po)n(l, N) - n(l. A) 


(18) 
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4.3 The size distribution of projects 


In order to solve Eq. (18), we define the new parameter p 

P=T^ ( 19 ) 

1 -Po 

where 1 < p < oo, because of po G (0,1). To interpret p |12| . we keep in mind that po actually decides how 
much of the growth (one developer per time unit) is spent on new projects as compared to established 
projects. In Sect. |5.1| we will test this relation against our empirical data. 


From Eq. (18), we find for the stationary solution for n{l,N) (denoted by *) 


^(l,iV) = 


iVpo 


P 


2 — Po P + 1 
whereas we find for the stationary solution of n{x, N): 


Npo 


n*{x, N) = -1,N)= - 1, N) 

l + (l-po)a; p + x 


We can solve this equation in an iterative manner, to find: 

(x — 1 ) (x — 2) 1 


n*(x, N) = 


{x = 2,...N) 


( 20 ) 


( 21 ) 


( 22 ) 


(p + x) {p+{x- 1)) ■■■ ip + 2) 

To further compact this expression, we make use of the so-called Gamma function r(z) with the property 
r(z -I- 1) = zT{z) that, for integer z, results in: 

r(^) = (z-l)! (23) 

The denominator of Eq. ( [2^ can then be expressed as: 

r(a; -I- p -I- 1) = (a; -I- p)r(a: + p) = {x + p){x + p - 1)...(2 -|- p)r(p -|- 2) (24) 

With this and the expression for n*(l, N), Eq. ([20|, we can rewrite Eq. 


(25) 


n*(a:,fV) = ^®^r(l,iV) 

r(a; -I- p -I- 1) 

= {P+ i) ^(^ + |) /*(i, N) = pB{x, p+l)Npo 
L (X + p + 1) 

where r(a;)r(p -|- l)/r(a: -I- p -I- 1) = B{x, p -I- 1) is the Beta function. 

The corrected, normalized Yule-Simon distribution, which holds also for x = 1, is then: 


Npo 


(x + p)P+i 


For large x we get for f{x, N): 


f{x,N) = pB{x,p+l) ^ px 


(26) 


(27) 


which has the form of a power law if x is large enough. Therefore, the power law approximates the Yule- 
Simon distribution only in its upper tail. To get Zipf’s law, one has to assume po —>■ 0, as p = 1/(1 — po) « 
1. However, as noticed, e.g. by |7] , for po —>■ 0 the convergence to the steady-state is infinitely slow. 
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5 Discussion 


5.1 Comparison with the Yule-Simon distribution 


We have now a theoretical prediction for the size distribution of projects, Eq. (261, and we have the 
respective empirical data for different years. Therefore, as a first step, we evaluate the kind of distribution 
that was already plotted in Fig. 


Before doing so, we have to argue whether the theoretical prediction and the empirical data really describe 
the same kind of projects. Our theoretical model is based on the assumption that all projects entering 
the system are potentially available to grow in the number of developers, i.e., developers can simply join 
them. This, however, cannot be confirmed for all single-developer projects listed in the database. Here, we 
have to consider that developers host their projects on SF not just to invite collaboration, but for various 
reasons, e.g. for archival purposes or just for distribution. While we cannot access the intrinsic reasons for 
a project to be hosted on SF, we argue that all new projects appearing on SF every month can be divided 
into two classes: (a) collaborative projects, i.e., projects that are meant to grow also by the contribution of 
other developers joining the project, and (b) non-collaborative projects, that are not aimed at attracting 
other developers and thus do not grow in size as measured by the number of developers, but maybe grow 
in their lines of code submitted by the project holder. I.e., we conjecture that there is a sizable number 
of single-developer projects that, from their very beginning, are not captured by our model that applies 
only to collaborative projects, i.e., projects with the potential to grow in the number of developers. 


Consequently, when comparing the predicted size distribution with empirical data, we have to take into 
account that /(I, N), i.e., the normalized density of projects of size 1, will need to be corrected, to subtract 
the non-collaborative projects, //(l,iV) (where ^ stands for non-collaborative) and to consider only the 
collaborative ones, f‘^{l,N) . The procedure for this necessary correction will be described further below. 
The resulting corrected size distribution will then be indicated by f^{x, N). 

As the null hypothesis for the size distribution we test for the Yule-Simon distribution for 

which the maximum likelihood of the parameter p can be computed numerically j4]. We perform a 
Kolmogorov-Smirnov (KS) test to determine the significance level (p-value) for which the empirical dis¬ 
tribution matches the Yule-Simon distribution, p = 0 means that the two distributions do not match 
under any circumstances. The higher the p-value, the more likely it is that the null hypothesis cannot be 
rejected. That means we cannot exclude that the Yule-Simon distribution is the right distribution, but 
there might be also other candidate distributions that could be considered (which we abstain from). 

Applying the KS test in its simplest form to skew distributions usually results in very high p-values, 
simply because the mass of the distribution is mostly concentrated in the head while the tail is weighted 
less. The problems resulting from this naive approach have been discussed in detail in [1]. These authors 
also propose a more reliable, but computationally more demanding, goodness-of-fit test suitable for heavy¬ 
tailed distributions which we adopt here. 

We first turn to the degree distribution of developers, an example of which is shown in Fig. [fright). Our 
goodness-of-fit test reveals that the, rather steep, “broad” developer degree distribution, f{k), does not 
follow a Yule-Simon distribution (p=0). But since we never made a hypothesis about this and did not 
develop a model for it, we just take this as a fact. 
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Figure 8: Project size distribution f{x) for different monthly snapshots. For each snapshot, a fit 
(dashed gray line) of the Yule-Simon distribution is plotted, for which the parameter paii was numeri¬ 
cally obtained. The goodness-of-fit test however rejects the hypothesis that the Yule-Simon distribution 
fits the empirical one for most of the snapshots. A second fit (solid gray line) of Yule-Simon distri¬ 
bution, for which the value for single-developer projects is taken as unknown, latent variable is also 
plotted, for which the same hypothesis cannot be rejected for the most of the snapshots. 


With respect to the project size distribution, f{x), we have to test the null hypothesis of the Yule-Simon 
distribution for each monthly snapshot, an example of which is show in Fig. |^left). Simply fitting the 
Yule-Simon distribution to the empirical one and calculating the parameter paii = 3.88 according to jl] 
would lead to the result shown in Fig. (lower right). The (dashed-line) fit is visually worse, it does not 
capture the tail well because the (uncorrected) value of is over-represented. 

Therefore, in the next step we correct f{l,N) as follows: Given the value paii obtained from all projects, 
we first predict the value /'^ (!> A^) for the collaborative single developer projects. Then, we do a new fit 
of the Yule-Simon distribution which leads to a corrected value With this new value, we do a better 
prediction of (1, A^) and so forth. This method is known as the expectation-maximization (EM) algo- 
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Figure 9: p-values for the goodness-of-fit test [T] of the Yule-Simon distribution and the empirical 
size distribution for each monthly snapshot. Collaborative projects refers to the project size distribution 
with corrected value for the single-developer projects. We observe that the Yule-Simon distribution is a 
plausible candidate for the size distribution of all projects only for certain periods of time, while for the 
corrected empirical distribution it is a plausible candidate for most of the times, notably from the late 
2005 to the late 2009. 


rithm |2], where the number of single-developer projects /'^(l,fV) is used as an unknown, latent variable. 
EM is an iterative algorithm that consists of alternating expectation steps (E) and maximization (M) 
steps. Expectation refers to predicting /°(1, N), while maximization refers to calculating the appropriate 
Pco\- We halt the algorithm when the change of pcoi is smaller then a given threshold e = 10“"^. This leads 
to the much better (solid gray line) fit shown in Fig. (lower right). Taking all corrected distributions 
of Fig. into account, we observe that the parameter pcoi stays almost constant over time with values 
around pcoi ~ 3 (which contrasts with the uncorrected distributions where p increases). 

Now, with the corrected size distribution f‘^{x,N), we can apply our rigorous goodness-of-fit test to each 
monthly snapshot. The results for the p-values are shown in Fig.[^ We see that the null-hypothesis of the 
Yule-Simon distribution as the empirical one cannot be rejected for all times between late 2005 and late 
2009. This gives us great confidence both in our modeling assumptions and in the proposed correction to 
distinguish between collaborative and non-collaborative projects. 

At the same time, it leads us to the question what has changed after the end of 2009, to make the fits 
invalid. As we observe, from 2010 the significance level goes down considerably, although it is hardly 
really zero. In order to better understand the dynamics from 2010, we will develop another conjecture in 
the following section. 


5.2 Estimations for pq 

In the previous section we demonstrated that the Yule-Simon distribution (and the underlying model) 
is a valid candidate for describing the empirical dynamics of collaborative projects at least for certain 
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Figure 10: Estimation of the parameter pq from the monthly entry rates of projects and developers, 
both for all projects/developers ( Po(t), yellow dots) and for collaborative projects/developers only ( 
red dots). The full line is the median of Pgit) at 0.6128. 


time intervals. We can link our findings for the size distribution, which refer to the systemic level, back 
to our assumptions for the microscopic dynamics. Recall that in the model of Simon m there is only 
one parameter pq that decides whether new developers found new projects, as opposed to joining existing 
ones. This parameter, as far as the theory goes, is directly linked to the exponent p of the distribution, 
via Eq. (19l. Assuming p = 3, Eq. (191 would give po = 2/3, which is quite high if compared, e.g., to the 
firm size distribution where p is about 1.2 and pq about 0.16. 


In this section we want to find an independent way of estimating pq. We recall that pq essentially describes 
how much of the total growth goes into newly founded projects. So, if Gtot is the growth spent on 
all existing projects and Gi is the growth spent on new projects during a given time interval, then 
Po = Gi/Gtot |13| . 


In our empirical data, Gtotif) is measured by the total number of developers per month that enter SF, 
ANd(t) which is shown in Fig. [fright). G'i(t), on the other hand, is given by the total number of newly 
founded projects per month, ANp{t), shown in Fig.j^left). So, we just divide these monthly entry rates, 
to obtain po{t) = ANp{t)/ANd{t) from the empirical data. We do this for the two different datasets: 
(a) for all all projects, and (b) for collaborative projects. For the latter, we need to correct also ANd{t) 
because we have to exclude those developers that joined SF to establish a non-collaborative project. These 
correction is done based on the empirical data. 


The results are plotted in Fig. 10 for the two different data sets: (a) p^ft) for all all projects and 
developers, and (b) Po(t) for the collaborative projects and developers. As expected from the above 
discussion, we see differences for the time periods before and after 2010. In fact, we see that after 2010 
p^it) has consistently values above 1, which cannot be realized from the assumption that only newly 
entering developers establish new projects. If the latter holds, pg is necessarily bound to values below 1. 
To explain this, we arrive at our second conjecture that after 2010 an increasing number of established 
developers started to found new projects. This, however, is not considered in our modeling assumptions. 
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therefore the prediction derived from the model necessarily fails as we also see from the low n-values in 

Fig.m 

If we look at collaborative projects, we see that p^it) follows the same trend as _Po(t), just with a shift 
toward lower values. To find out whether this shift results from a higher entry rate of collaborative projects 
or a lower entry rate of collaborative developers, we have plotted in Fig. [fright) ANp{t). We verify that 
ANp{t) for collaborative projects is almost constant over all years, i.e., the increase in p§(t) is from the 
lower entry rate of collaborative developers. Because of the similar trend compared to after 2010 
we also keep our conjecture that an increasing number of established developers started to found new 
collaborative projects. This violation of our modeling assumption can be confirmed also in Fig. where 
we see that the p-values for the goodness-of-fit test break down after end of 2009 for the collaborative 
projects. If we take the median for over the whole time period, we find pjj « 0.61 which is in a 

remarkable agreement with the theoretical value Pq = 2/3 obtained from p = 3. 

Eventually, we want to discuss an additional issue in comparing the empirical and theoretical results. 
By means of the EM method, we found a way to correct f{l,N) such that only collaborative projects 
are taken into account. The corrected value /'^(l,iV) can also be related to the empirical number of 
collaborative single-developer projects. I.e., by tracing their history, we can identify in the data set those 
single-developer projects that grew at a later point in time. Their monthly growth rate ANp is already 
plotted in Fig. [fright) (blue line). We observe that there is a shift between the predicted growth rate 
of collaborative single-developer projects (green line) and the empirical one, which slightly increases over 
time from values of 1.5 to 2. 

The cause for this mismatch should not be attributed to the predicted value, but rather to the empirical 
one because it underestimates the number of collaborative projects for the following reason. Our empirical 
classification of collaborative vs. non-collaborative projects is based on their observed growth, only. If we 
classify projects as non-collaborative, we make a mistake because projects may still grow later in time, 
but this is just not observed. This mistake becomes larger the closer we come to the end of the data 
set. Therefore, to estimate the magnitude of the mistake, we should look at the oldest projects in the 
data set, which date back to November/December 2004 (when the data became reliable). We verify that 
the interval between the time when these projects were created and the time when a second developer 
joined, can be well described by an exponential distribution, f{t) = Aexp“'''b The expected value of this 
distribution is E[t] = 1/A, which can be also measured from the data for the almost 3000 projects created 
in these two months, to yield roughly 450 days. The cumulative distribution gives us the probability that 
those projects will grow before the 450 days as Pr(t < E[t]) « 0.6. 

This can now be used to calculate the mistake made by classifying projects as non-collaborative during 
the last 450 days of the data set. It is about 40%, i.e., we miss 40% of the collaborative projects in our 
estimation. The correction for the observed number thus will be a factor of 1/0.6 « 1.6, which is very 
close to the observed mismatch of 1.5 to 2. I.e., with this rough estimation we can explain fairly well the 
observed difference between the empirical and EM estimates. 

To conclude our discussion so far, we find that the Yule-Simon distribution is a valid candidate to describe 
the empirical size distribution of eollaborative projects. However, this validity is constrained to certain 
time periods for which the underlying assumptions of the model can be justified. We observe that in 
later time periods established developers started to create additional projects. This was not considered 
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in the model assumptions to derive the Yule-Simon distribution, where only newly entering developers 
are considered to create new projects. 

5.3 Extensions 

In this paper we have investigated to what extend an established model for the entry of new firms 
and the growth of existing ones originally developed by Simon m can be directly applied to OSS 
communities, where new projects are founded by new developers and existing projects grow by attracting 
new developers. 

The advantages of using the Simon model go alongside with the disadvantages resulting from the limita¬ 
tions arising from the underlying assumptions. We discuss them here, to give hints for further improve¬ 
ments of the model, because we noticed that the Yule-Simon distribution, over large time intervals, has 
shown to be a promising candidate for the size distribution of collaborative projects. However, each of the 
suggestions discussed below will modify the original model such that the analytical approach developed 
can no longer be used and closed-form solutions will most likely not be derived. 

The first suggestion relates to a known criticism of the Yule-Simon model, namely that not more than 
one project can be founded or grow at each time step. This does not make problems as long as one is 
interested in the asymptotic size distribution. But in order to come up with a more realistic dynamics 
before the steady state is reached, one should consider that projects can be founded and grow in parallel. 
In particular, one has to consider concurrent activities, i.e., that not only new developers perform an 
action, but also established developers can decide to have more than one project, e.g., by founding a 
new one. As a consequence, po, the probability to found a new project, shall become a heterogeneous 
parameter, to better reflect individual motivations of developers. By this, we could further distinguish 
between different personalities, e.g., founders, who prefer to start new projects, and contributors, who 
prefer to join existing projects. 

As a second suggestion, we can consider that new developers may have a preference for larger or smaller 
projects, i.e., the probability to choose from size class (x — 1) is also proportional to (x — 1)“. This was 
already implemented in the dynamic assumptions of Eq. Q as a new element not discussed in |12| . a = 0 
would recover the case of no size preference (as, e.g., also used to describe the firm growth dynamics na), 
a = 1 would be a preference directly proportional to the existing size, and a < 0 would indicate that 
projects become less attractive with increasing size, probably because of coordination and integration 
efforts. Hence, the additional parameter a allows us to consider various (monotonous) size-dependent 
preferences. To account for optimal project sizes, this dependence should be also non-monotonous. In 
our formal approach, we have set a = 1 to favor the mathematical approach by which the project size 
distribution was derived. Without this restriction to closed-form solutions, an agent-based simulation 
could explore the impact of size-dependent preferences on the project size distribution. 

As a third suggestion, we should allow contributors to switch between projects, in order to better utilize 
their skills. This would also have consequences for the knowledge spillovers between projects, which 
is an important consideration for management science and economics. On the formal modeling level, 
such additional assumptions would change the rate equation approach developed in Sect. [T^ by adding 
additional terms for the shrinkage of existing projects (developers leave) and for the growth of existing 
projects by other than newly registered developers. 
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A different set of possible extensions points to the way the developer activity is counted in. So far, we have 
assumed that a newly entered developer immediately founds a new project or joins an existing one. But 
developers may have joined SF for different other reasons, e.g., for getting better access to code to re-use 
outside of SF. This may lead to a mismatch between the number of developers entering SF per month 
and the assumed activity of these developers inside SF. In the same line, in our analysis developers are 
assigned to projects, which is indicated by a link, and our modeling approach assumes that such links do 
not change. However, links do not necessarily mean that developers are actively working for the project, 
they are only a first (and not necessarily the best) approximation of contributions. Here, we note that 
already more refined measures are available which are discussed in a subsequent paper m- But these 
measures largely depend on information that is not available from Sourceforge.net, so we will have to 
resort to Github. com 

We conclude that, even with these limitations on the SF data, our analysis about the launch of new 
projects and their subsequent growth is one of the largest investigations on such data to date. It resulted 
in new findings about the project size distribution and the degree distribution of developers, about their 
entry and exit rates, the preferred usage of programming languages. In order to better understand the 
dynamics that generated these systemic properties, we utilized an established economic model that in 
this paper has proven to be a valuable candidate also for the modeling of socio-technical systems such as 
OSS communities. At the same time, the model also revealed some shortcomings which helps us to better 
understand the role of underlying assumptions and their limitations. At the end, not only the (positive) 
confirmation, but also the (negative) rejection of modeling assumptions both generate important insights 
into the dynamics of real socio-technical systems. 
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